Creator: Maggie Xu

Q1: How have arrest numbers changed over time (monthly, quarterly, or yearly)? Are there noticeable spikes or declines (e.g., during COVID, summer months)?

Question 1a:Yearly Trend

Arrests %>%
  count(year) %>%
  ggplot(aes(x = year, y = n)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Yearly Arrests",
    x = "Year", y = "Number of Arrests"
  ) +
  theme_minimal()

observations:

- Arrests start high in ~2010, with counts over 3,500 per year.
- Arrest numbers remain relatively stable between ~2010 and 2016, fluctuating slightly between ~3,200–3,600.
- Beginning around 2017–2018, there is a noticeable decline in arrests.
- The drop becomes much sharper around 2019–2020, falling to nearly half of earlier levels — this likely corresponds with the COVID-19 pandemic and lockdowns, when many cities saw decreased police activity and fewer arrests.
- After 2020, there’s a small rebound, but the counts remain significantly lower than pre-2017 levels.

Question 1b: Monthly Trend

Arrests %>%
  count(year, month_num) %>%
  mutate(YearMonth = as.Date(paste(year, month_num, "01", sep = "-"))) %>%
  ggplot(aes(x = YearMonth, y = n)) +
  geom_line() +
  labs(title = "Monthly Arrests Over Time", x = "Year-Month", y = "Number of Arrests") +
  scale_x_date(date_breaks = "6 months", date_labels = "%Y-%m") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  theme_minimal()

Observations:

2009–2016: Arrests fluctuate between ~250–400 per month, with periodic peaks (approaching ~500) and valleys. There appears to be some seasonality: higher counts in some mid-year (likely summer) months, lower in others (likely winter).

2017–2018: Monthly counts start to decline gradually compared to earlier years. The peaks are less pronounced, staying mostly below 350.

2019–2020: Sharp decline begins late 2019 and early 2020 — arrests drop rapidly to below 200 per month. This drop aligns with the COVID-19 pandemic and lockdowns, when many jurisdictions reduced arrests.

2021–2025: Counts remain low (~100–150 per month) compared to pre-2019 levels. There are occasional small spikes, but they stay much lower than earlier years, suggesting a lasting reduction.

HEAT MAP

Arrests %>%
  count(year, month) %>%
  ggplot(aes(x = month, y = factor(year), fill = n)) +
  geom_tile() +
  scale_fill_viridis_c() +
  labs(
    title = "Heatmap of Arrests by Month and Year",
    x = "Month", y = "Year", fill = "Arrests"
  )+
  theme_minimal()

Observations: 
2010–2016:
Arrest levels are high overall (green to yellow hues) with a clear concentration in summer months (June–August) — particularly July & August, which show the most intense (yellow) spots.
Winter months (January–February, December) consistently show lower levels (blueish shades), which is expected due to seasonality.

2017–2018:
Arrests begin to decline slightly — the heatmap colors shift towards cooler greens and blues even in summer months, suggesting fewer arrests compared to earlier years.

2019–2020:
A dramatic drop appears starting in 2020 — most months, including summer, turn dark blue indicating very low arrest counts.
The drop is most pronounced in spring and summer 2020 — coinciding with COVID-19 restrictions.

2021–2024:
Arrest levels remain low (blue to dark blue), and the distinct summer peaks almost disappear.
The seasonality seems flattened — suggesting the usual summer increase has weakened post-pandemic.

Seasonal pattern:
Consistent before 2017:
- Peaks: June–August
- Lows: January–February & December
Post-2020, this pattern weakens.

Boxplot: Distribution of arrests by month

# Prepare data
monthly_counts <- Arrests %>%
  count(year, month) %>%
  mutate(
    month_num = as.numeric(month),
    # assign season
    season = case_when(
      month_num %in% c(12, 1, 2)  ~ "Winter",
      month_num %in% c(3, 4, 5)   ~ "Spring",
      month_num %in% c(6, 7, 8)   ~ "Summer",
      month_num %in% c(9, 10, 11) ~ "Fall"
    )
  ) %>%
  filter(!is.na(season))   %>%
 mutate(season = droplevels(factor(season)))

# Define seasonal colors
season_colors <- c(
  "Winter" = "skyblue3",
  "Spring" = "springgreen3",
  "Summer" = "indianred",
  "Fall"   = "gold"
)

# Plot
p <- ggplot(monthly_counts, aes(x = month, y = n, fill = season, color = season, group = month)) +
  geom_boxplot() +
  scale_y_continuous(breaks = seq(0, 500, 50)) +
  scale_fill_manual(values = season_colors) +
  scale_color_manual(values = season_colors) +
  labs(
    title = 'Distribution of Monthly Arrests by Month',
    subtitle = 'Year: {current_frame}',
    x = 'Month', y = 'Number of Arrests', fill = "Season", color = "Season"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.subtitle = element_text(size = 18, face = "bold", hjust = 0.5)
  )

# Animate with transition_manual
anim <- p +
  transition_manual(year)

# save animation: 
# animate(anim, width = 900, height = 600, fps = 1, duration = length(unique(monthly_counts$year)) * 4, renderer = gifski_renderer())
# anim_save("animated_boxplot_monthly_arrests_synced.gif")

General Trends:
The boxplots show clear seasonality:
 Winter (Jan–Feb, Dec) — lowest arrests, consistently around ~200–250.
 Spring (Mar–May) — arrests start increasing, reaching ~300–350.
 Summer (Jun–Aug) — peaks, especially July & August, with arrests reaching ~400–500.
 Fall (Sep–Nov) — begins to decline from summer peaks, stabilizing around ~300.

The seasonal pattern is most visible and consistent in the earlier years (2010–2016).
#checking why no spring and summer in 2020
# List of all years & months in your data

all_months <- month.abb

# All year-month combinations
all_combinations <- expand_grid(
  year = unique(Arrests$year),
  month = all_months
)

# Actual counts
actual_counts <- Arrests %>%
  count(year, month)

# Convert both month columns to character explicitly
all_combinations <- all_combinations %>%
  mutate(month = as.character(month))

actual_counts <- actual_counts %>%
  mutate(month = as.character(month))

# Left join + fill missing with 0
complete_counts <- all_combinations %>%
  left_join(actual_counts, by = c("year", "month")) %>%
  mutate(n = replace_na(n, 0)) %>%
  arrange(year, match(month, all_months))

# Show missing months
missing_months <- complete_counts %>% filter(n == 0)

print(missing_months)
## # A tibble: 20 × 3
##     year month     n
##    <dbl> <chr> <int>
##  1  2021 Apr       0
##  2  2021 May       0
##  3  2021 Jun       0
##  4  2021 Jul       0
##  5  2021 Aug       0
##  6  2021 Sep       0
##  7  2024 May       0
##  8  2024 Jun       0
##  9    NA Jan       0
## 10    NA Feb       0
## 11    NA Mar       0
## 12    NA Apr       0
## 13    NA May       0
## 14    NA Jun       0
## 15    NA Jul       0
## 16    NA Aug       0
## 17    NA Sep       0
## 18    NA Oct       0
## 19    NA Nov       0
## 20    NA Dec       0

Seasonal barplot

Arrests %>%
  count(year, season) %>%
  filter(!is.na(season)) %>% 
  ggplot(aes(x = factor(year), y = n, fill = season)) +
  geom_bar(stat = "identity", position = "dodge") + 
  scale_fill_manual(values = season_colors) +
  theme_minimal()+
  labs(
    title = "Arrests by Season and Year",
    x = "Year", y = "Number of Arrests", fill = "Season"
  )

General Observations:
1. In all years, Summer shows the highest number of arrests, peaking around ~1,200 in 2010–2016.
2. Fall and Spring are similar to each other, generally slightly below Summer but still high.
3. Winter consistently has the lowest number of arrests each year — roughly ~50–70% of Summer.

Trend Over Time:
Arrest numbers declined steadily from ~2015 to 2019 across all seasons.
- Summer arrests dropped from ~1,100 in 2014 to ~600–700 by 2019.
- Fall and Spring show a similar downward trend.
- Winter remained the lowest but also declined.

Impact of COVID-19:
In 2020–2021, a dramatic drop is visible in all seasons:
Winter remains low.
Summer and Fall drop sharply, to about ~300 or lower — nearly ¼ of the pre-pandemic peak.
Spring also falls dramatically.
This aligns with what we observed in earlier plots: the pandemic significantly reduced arrests.

Post-2021 (pandemic ends in May 2023):
In 2022–2024, some recovery is observed:
- Slight uptick in Spring and Fall arrests.
- Summer remains much lower than its pre-2020 highs.
- Overall, the seasonal differences are less pronounced in recent years than before 2020.

Seasonal Patterns throughout the years:
Summer > Fall ≈ Spring > Winter

Calendar Heatmap

# Prepare data
daily_arrests <- Arrests %>%
  filter(!is.na(ArrestDate)) %>%
  count(ArrestDate) %>%
  mutate(year = lubridate::year(ArrestDate))

# Get years
years <- sort(unique(daily_arrests$year))

# Loop: save PNG for each year
for (yr in years) {
  cat("Rendering year:", yr, "\n")
  
  p <- ggplot_calendar_heatmap(
      daily_arrests %>% filter(year == yr),
      'ArrestDate',
      'n'
    ) +
    scale_fill_gradient(low = "white", high = "red", name = "Arrests", na.value = "grey") +
    labs(
      title = sprintf("Calendar Heatmap of Daily Arrests — Year: %s", yr),
      subtitle = "Darker colors indicate more arrests, grey = missing data"
    ) +
    theme_minimal(base_size = 14)
  
 # ggsave(sprintf("calendar_%s.png", yr), plot = p, width = 12, height = 8)
}
## Rendering year: 2010 
## Rendering year: 2011 
## Rendering year: 2012 
## Rendering year: 2013 
## Rendering year: 2014 
## Rendering year: 2015 
## Rendering year: 2016 
## Rendering year: 2017 
## Rendering year: 2018 
## Rendering year: 2019 
## Rendering year: 2020 
## Rendering year: 2021 
## Rendering year: 2022 
## Rendering year: 2023 
## Rendering year: 2024
imgs <- list.files(
  path = "heat map calendar for each year",
  pattern = "\\.png$",
  full.names = TRUE
) |> sort()

length(imgs)
## [1] 0
output_gif <- "calendar_heatmap_animation.gif"

# gifski(
#  png_files = imgs,
#  gif_file = output_gif,
#  delay = 2,   # adjust speed (seconds per frame)
#  width = 1200,
#  height = 800,
#  loop = TRUE
#)

Q2: How does the frequency and type of crime change across months across different ZIP codes?

Incidents <- incidents %>%
  mutate(
    Date = as.Date(Occur_Date),           # adjust as needed
    Month = month(Date, label = TRUE),
    Year = year(Date)
  ) %>%
  filter(!is.na(Latitude) & !is.na(Longitude)) # keep valid rows

Loops of ARREST_COUNT 2010-2024 each month

# create output folder if it doesn't exist
if (!dir.exists("maps_2010")) dir.create("maps_2010")

# loop through months 1 to 12
for (m in 1:12) {
  month_name <- month.abb[m]  # "Jan", "Feb", etc.
  
  freq_grid <- Arrests %>%
    filter(year == 2010, month == month_name) %>% 
    mutate(
      rlat = round(latitude * 200) / 200,
      rlon = round(longitude * 200) / 200
    ) %>%
    group_by(rlat, rlon) %>%
    summarize(n = n(), .groups = "drop")
  
  if (nrow(freq_grid) == 0) next
  
  p <- ggplot() +
    geom_tile(data = freq_grid, aes(x = rlon, y = rlat, fill = n)) +
    geom_path(data = ch, aes(x = V1, y = V2), color = "black") +
    coord_quickmap(xlim = c(-79.09, -78.99), ylim = c(35.86, 35.99)) +
    labs(
      title = "Arrest Frequency in Chapel Hill",
      subtitle = sprintf("2010 - Month: %s", month_name),
      fill = "Arrest Count",
      x = "Longitude",
      y = "Latitude"
    ) +
    scale_fill_viridis_c() +
    theme_minimal()
  
#  ggsave(filename = sprintf("maps_2010/arrests_2010_%s.png", month_name),plot = p, width = 8, height = 6)
  
}


# get all PNGs recursively
pngs <- list.files(
  path = "yearly_maps", 
  pattern = "\\.png$", 
  recursive = TRUE, 
  full.names = TRUE
)

# sort files by year and month
pngs_sorted <- pngs[order(pngs)]  # assumes your files are named arrests_YYYY_MMM.png

# order chronologically
month_order <- setNames(1:12, month.abb)

pngs_sorted <- pngs %>%
  tibble(path = .) %>%
  mutate(
    fname = basename(path),
    year = as.numeric(stringr::str_extract(fname, "\\d{4}")),
    month_str = stringr::str_extract(fname, paste(month.abb, collapse = "|")),
    month_num = month_order[month_str]
  ) %>%
  arrange(year, month_num) %>%
  pull(path)

#gifski(
#  png_files = pngs_sorted,
#  gif_file = "all_years_arrests.gif",
#  width = 1200,
#  height = 800,
#  delay = 2,   # seconds per frame
#  loop = TRUE
#)

# Interpreter: Sarah Bazari

Q1: Which zip code in Chapel Hill experiences the most arrests? And accordingly, what classifications of offenses are most common to each zip code?

# filters just for chapel hill
chapel_hill_arrests = Arrests %>%
  filter(City == "CHAPEL HILL")
# count arrests per zipcode
zip_counts = chapel_hill_arrests %>%
  count(Zip, sort = TRUE)

zip_counts
## # A tibble: 5 × 2
##     Zip     n
##   <dbl> <int>
## 1 27514 22774
## 2 27516 11808
## 3 27517  2705
## 4 27515    18
## 5 27599     4
# count most common charges per zipcode
offense_by_zip = chapel_hill_arrests %>%
  count(Zip, Charge, sort = TRUE)

# see the top offenses per zip
offense_by_zip %>%
  group_by(Zip) %>%
  slice_max(n, n = 3) %>% # top 3 per zip
  arrange(Zip, desc(n))
## # A tibble: 19 × 3
## # Groups:   Zip [5]
##      Zip Charge                                          n
##    <dbl> <chr>                                       <int>
##  1 27514 IMPAIRED DRIVING DWI                         2458
##  2 27514 FAIL TO APPEAR/COMPL                         2263
##  3 27514 OPEN CONTAINER                               1565
##  4 27515 INJURY TO TREES/LANDCAPE  M                     4
##  5 27515 ASSAULT ON FEMALE                               2
##  6 27515 ASSAULT ON GOVERMENT OFFICIAL                   2
##  7 27515 DRIVING WHILE IMPAIRED                          2
##  8 27515 FAIL TO APPEAR/COMPL                            2
##  9 27515 IMPAIRED DRIVING DWI                            2
## 10 27515 LARCENY -  ALL OTHER                            2
## 11 27515 OPEN CONTAINER                                  2
## 12 27516 FAIL TO APPEAR/COMPL                         1108
## 13 27516 OPEN CONTAINER                               1001
## 14 27516 IMPAIRED DRIVING DWI                          867
## 15 27517 IMPAIRED DRIVING DWI                          369
## 16 27517 FAIL TO APPEAR/COMPL                          258
## 17 27517 ASSAULT ON FEMALE                             128
## 18 27599 DRUG PARAPHERNALIA                              2
## 19 27599 POSS OR MANUFACTURE  FRAUDULENT FORMS OF ID     2
# visualiztion bar plot

ggplot(zip_counts, aes(reorder(as.character(Zip), -n), y = n)) +
  geom_col(fill = "#E63946", width = 0.7) +  # Nice deep red tone
  labs(
    title = "Number of Arrests by Zip Code in Chapel Hill",
    subtitle = "Zip codes with the highest arrest counts shown in descending order",
    x = "Zip Code",
    y = "Number of Arrests"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12, color = "gray40"),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 12),
    axis.text.y = element_text(size = 12),
    axis.title = element_text(face = "bold"),
    plot.margin = margin(10, 20, 10, 10)
  )

Q2: What types of charges are most likely to involve weapons, and does weapon presence vary by race or gender?

# most common charges for each weapon type

charges_by_weapon = Arrests %>%
  filter(!is.na(Weapon)) %>%
  group_by(Weapon, Charge) %>%
  summarize(Count = n(), .groups = "drop") %>%
  arrange(Weapon, desc(Count))

head(charges_by_weapon, 10)
## # A tibble: 10 × 3
##    Weapon                        Charge                                 Count
##    <chr>                         <chr>                                  <int>
##  1 CLUB/BLACKJACK/BRASS KNUCKLES ASSAULT W DEADLY WEAPON                    4
##  2 CLUB/BLACKJACK/BRASS KNUCKLES PHYSICAL SIMPLE ASSAULT-NON AGGRAVATED     4
##  3 CLUB/BLACKJACK/BRASS KNUCKLES ADW-OTHER WEAPON                           3
##  4 CLUB/BLACKJACK/BRASS KNUCKLES CCW                                        3
##  5 CLUB/BLACKJACK/BRASS KNUCKLES ASLT INFL BODY INJURY  F                   2
##  6 CLUB/BLACKJACK/BRASS KNUCKLES ASSAULT & BATTERY                          2
##  7 CLUB/BLACKJACK/BRASS KNUCKLES ASSAULT ON FEMALE                          2
##  8 CLUB/BLACKJACK/BRASS KNUCKLES ASSAULT ON GOVERNMENT OFFICIAL             2
##  9 CLUB/BLACKJACK/BRASS KNUCKLES ASSAULT-SIMPLE                             2
## 10 CLUB/BLACKJACK/BRASS KNUCKLES AWIK/SERIOUS INJURY                        2
# weapon presence by race and gender

weapon_by_race_gender = Arrests %>%
  filter(!is.na(Weapon), Gender %in% c("M", "F"), Race !="U") %>%
  group_by(Weapon, Race, Gender) %>%
  summarise(Count = n(), .groups = "drop") %>%
  arrange(Weapon, desc(Count))

head(weapon_by_race_gender, 10)
## # A tibble: 10 × 4
##    Weapon                        Race  Gender Count
##    <chr>                         <chr> <chr>  <int>
##  1 CLUB/BLACKJACK/BRASS KNUCKLES B     M         12
##  2 CLUB/BLACKJACK/BRASS KNUCKLES W     M          7
##  3 CLUB/BLACKJACK/BRASS KNUCKLES B     F          4
##  4 CLUB/BLACKJACK/BRASS KNUCKLES W     F          4
##  5 CLUB/BLACKJACK/BRASS KNUCKLES H     M          2
##  6 FIREARM (TYP NOT STATED)      B     M         23
##  7 FIREARM (TYP NOT STATED)      B     F          5
##  8 FIREARM (TYP NOT STATED)      W     M          2
##  9 HANDGUN                       B     M        224
## 10 HANDGUN                       W     M         70
# visual 'weapon_by_race_gender'

library(stringr)

weapon_by_race_gender$Weapon <- str_wrap(weapon_by_race_gender$Weapon, width = 15)

ggplot(weapon_by_race_gender, aes(x = Race, y = Count, fill = Gender)) +
  geom_col(position = "dodge", width = 0.7) +
  facet_wrap(~ Weapon, scales = "free_y", ncol = 3) +
  scale_fill_manual(values = c("F" = "#f8766d", "M" = "#00bfc4")) +
  labs(
    title = "Weapons Involved in Arrests by Race and Gender",
    subtitle = "Grouped by weapon type and separated by gender",
    x = "Race",
    y = "Number of Arrests",
    fill = "Gender"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    strip.text = element_text(face = "bold", size = 9, angle = 0),
    strip.text.x = element_text(margin = margin(b = 5)),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "top"
  )

# visual for most common charges for the top 3 weapon types

library(ggplot2)

top_charges = Arrests %>%
  filter(!is.na(Charge)) %>%
  count(Charge, sort = TRUE) %>%
  slice_max(n, n = 15)

ggplot(top_charges, aes(x = reorder(Charge, n), y = n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 15 Charges in Chapel Hill", x = "Charge", y = "Number of Arrests") +
  theme_minimal()

# fix name of one of the weapons
data = Arrests
data <- data %>%
  mutate(Weapon = str_to_upper(Weapon)) %>%
  mutate(Weapon = ifelse(Weapon == "LETHAL CUTTING INSTRUMENT", "CUTTING INSTRUMENT", Weapon))

# top 3 weapons
top_weapons <- data %>%
  filter(!is.na(Weapon)) %>%
  count(Weapon, sort = TRUE) %>%
  slice_max(n, n = 3) %>%
  pull(Weapon)

# top 5 charges each weapon
top_charges <- data %>%
  filter(Weapon %in% top_weapons, !is.na(Charge)) %>%
  group_by(Weapon, Charge) %>%
  summarise(Count = n(), .groups = "drop") %>%
  group_by(Weapon) %>%
  slice_max(Count, n = 5) %>%
  ungroup()

top_charges$Charge <- str_wrap(top_charges$Charge, width = 20)

ggplot(top_charges, aes(x = Charge, y = Count)) +
  geom_col(fill = "steelblue") +
  facet_wrap(~ Weapon, scales = "free") +
  labs(
    title = "Top 5 Charges for the 3 Most Common Weapon Types",
    x = "Charge", y = "Number of Arrests"
  ) +
  coord_cartesian(clip = "off") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 55, hjust = 1, size = 5),
        strip.text = element_text(size = 10)
)

Interpreter: Mariana Rodriguez-Pacheco

Q2: Are any age groups (e.g., young: 18–25, middle-aged: 26–45, older: 46+) strongly associated with a specific type of charge?

age = Arrests$Age
age_group = c()

for (i in 1:length(age)) {
  if (is.na(age[i])) {
    age_group = c(age_group, "Unknown")
  } else if (age[i] <= 25) {
    age_group = c(age_group, "Young")
  } else if (age[i] <= 45) {
    age_group = c(age_group, "Middle-aged")
  } else {
    age_group = c(age_group, "Older")
  }
}

age_group = factor(age_group, levels = c("Young", "Middle-aged", "Older", "Unknown"))
police_arrests = mutate(Arrests, Age_Group = age_group)
top_10_data = police_arrests %>%
  group_by(Age_Group, Charge) %>%
  summarise(Count = n()) %>%
  mutate(Proportion = Count / sum(Count)) %>%
  arrange(Age_Group, desc(Proportion)) %>%
  mutate(Rank = min_rank(desc(Proportion))) %>%
  filter(Rank <= 10)
## `summarise()` has grouped output by 'Age_Group'. You can override using the
## `.groups` argument.
ggplot(top_10_data) +
  geom_bar(aes(x = reorder(Charge, -Proportion), y = Proportion, fill = Age_Group), stat = "identity") +
  facet_wrap(~Age_Group, nrow = 1) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 6)) +
  labs(title = "Top 10 Charges by Age Group", y = "Proportion", x = "Charge")

Orator: Nicholas Copland

Q1: What percentage of incidents led to arrests? (*Note: This question is what led us to discover a lack of correlation between the datasets and abandon the incidents data alltogether)

incidents = incidents %>%
  mutate(
    Year = year(Report_Date)
  )

arrests = Arrests %>%
  mutate(
    Year = year(Arrest_Date)
  )

in1 = incidents %>%
  filter(Year %in% 2016:2020, Offense == "SHOTS FIRED" | Offense == "GUNSHOTS")

ar1 = arrests %>%
  mutate(Year = year(Arrest_Date)) %>%
  filter(Year %in% 2016:2020, 
         Weapon %in% c("HANDGUN", "RIFLE", "SHOTGUN", "OTHER FIREARM", "FIREARM (TYP NOT STATED)"))

in1 <- in1 %>% mutate(Incident_Day = as.Date(Report_Date))
ar1 <- ar1 %>% mutate(Arrest_Day = as.Date(Arrest_Date))

ar1_same_day <- ar1 %>%
  filter(Arrest_Day %in% in1$Incident_Day)

incidents_plot <- in1 %>%
  transmute(
    Lat = Latitude,
    Long = Longitude,
    Type = "Incident",
    Day = Incident_Day
  )

arrests_plot <- ar1_same_day %>%
  transmute(
    Lat = latitude, 
    Long = longitude,
    Type = "Arrest",
    Day = Arrest_Day
  )

plot_data <- bind_rows(incidents_plot, arrests_plot)

shape_values <- c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)

ggplot(plot_data, aes(x = Long, y = Lat, color = Type, shape = as.factor(Day))) +
  geom_point(size = 3, alpha = 1) +
  labs(
    title = "Gun Incidents & Same-Day Gun Arrests",
    x = "Longitude",
    y = "Latitude",
    color = "Type",
    shape = "Date (Same-Day Pairs)",
    caption = "Red = Incident, Blue = Arrest\nDifferent shapes = different days"
  ) +
  theme_minimal() +
  scale_shape_manual(values = shape_values) + 
  guides(
    shape = guide_legend(override.aes = list(size = 4)),
    color = guide_legend(override.aes = list(size = 4))
  )  

Update: We are no longer using the incidents data set partly because of this EDA question that revealed a few things about the nature of that data. 1. There are no incident reports for the date of the UNC shooting at the correct time 2. There is no arrest report for the date of the UNC shooting These two factors helped us to learn the incident data is a lot less comprehensive than we thought. Our theory on the shootings absence from the arrest data is that because other departments were involved CHPD didn’t report it as they didn’t make the arrest. It likely shows up in the Orange County system or maybe the State Trooper data.

Q2: How do crime patterns within Chapel Hill fluctuate in relation to the UNC academic calendar, and do certain types of crimes peak during student presence (e.g., semester sessions) or absence (e.g., breaks and holidays)?

#Creating Two Categories: Fall Semester and Spring Semester (when most students are on campus) and Breaks (When most students are off campus)
arrests = Arrests  %>%
  mutate(
    Year = year(Arrest_Date),
    Month = month(Arrest_Date),
    Day = day(Arrest_Date),
    Hour = hour(Arrest_Date)
  )

Semesters = arrests %>%
  filter(
    (Month == 1 & Day >= 7) |
    (Month %in% 2:4) |
    (Month == 5 & Day <= 10) |
    (Month == 8 & Day >= 15) |
    (Month %in% 9:11) |
    (Month == 12 & Day <= 14)
  )

Breaks = arrests %>%
  filter(
    !((Month == 1 & Day >= 7) |
    (Month %in% 2:4) |
    (Month == 5 & Day <= 10) |
    (Month == 8 & Day >= 15) |
    (Month %in% 9:11) |
    (Month == 12 & Day <= 14)))
OnBreak = arrests %>%
  mutate(OnBreak = (ifelse(Object_Id %in% Breaks$Object_Id, "On Break", "Semester"))) %>%
  mutate(YearMonth = as.Date(paste(Year, Month, "01", sep = "-"))) %>%
  group_by(YearMonth, OnBreak) %>%
  summarise(Count = n(), .groups = "drop")

ggplot(OnBreak, aes(x = YearMonth, y = Count, fill = OnBreak)) +
  geom_col(position = "stack") +
  labs(
    title = "Monthly Arrest Counts: Semester vs. Break",
    x = "Month",
    y = "Number of Arrests",
    fill = "Status"
  ) +
  theme_minimal()

Orator: Christian Hinson

Load and Prepare Data

# Load dataset
#print(colnames(arrests))

arrests <- Arrests %>%
  filter(Drugs_Alcohol %in% c("Y", "N")) %>%
  mutate(
    Substance = if_else(Drugs_Alcohol == "Y", "Substance Present", "No Substance"),
    arrest_datetime = ymd_hms(Arrest_Date),
    hour = hour(arrest_datetime),
    day = day(arrest_datetime),
    month = month(arrest_datetime, label = TRUE, abbr = FALSE),
    weekday = wday(arrest_datetime, label = TRUE, abbr = FALSE)
  )

Arrests by Hour of Day

# Clean and transform data
arrests <- arrests %>%
  filter(Drugs_Alcohol %in% c("Y", "N")) %>%
  mutate(
    Substance = if_else(Drugs_Alcohol == "Y", "Substance Present", "No Substance"),
    Arrest_Date = ymd_hms(Arrest_Date),
    hour = hour(Arrest_Date)
  )

# Summarize by hour
hourly_counts <- arrests %>%
  count(hour, Substance) %>%
  group_by(Substance) %>%
  mutate(Proportion = n / sum(n))

# Plot
ggplot(hourly_counts, aes(x = hour, y = Proportion, fill = Substance)) +
  geom_col(position = "dodge") +
  labs(
    title = "Proportion of Arrests by Hour of Day",
    x = "Hour of Day (0 = Midnight, 23 = 11PM)",
    y = "Proportion of Arrests",
    fill = "Substance Involvement"
  ) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_x_continuous(breaks = 0:23) +
  theme_minimal()

Arrests by Day of Week

# Summarize by day of week
weekday_counts <- arrests %>%
  mutate(day_of_week = wday(Arrest_Date, label = TRUE, abbr = FALSE)) %>%
  count(day_of_week, Substance) %>%
  group_by(Substance) %>%
  mutate(Proportion = n / sum(n))

ggplot(weekday_counts, aes(x = day_of_week, y = Proportion, fill = Substance)) +
  geom_col(position = "dodge") +
   geom_text(
    aes(label = scales::percent(Proportion, accuracy = 0.1)),
    position = position_dodge(width = 0.9),
    vjust = -0.5,
    size = 3.5
  ) +
  labs(
    title = "Arrest Proportions by Day of Week and Substance Presence",
    x = "Day of Week",
    y = "Proportion of Arrests",
    fill = "Substance"
  ) +
  scale_y_continuous(labels = percent_format())

Interpretation

The analysis reveals a clear temporal pattern in substance-involved arrests within Chapel Hill. In particular there is a significant increase in the proportion of arrests involving substances on the days of Friday, Saturday, Sunday. Even more specific there is a substantial concentration of arrests between the times 7pm - 3am. This strongly correlates with the typical nightlife activity in the area. This indicates that law enforcement should strategize to be more proactive and aware of these elevated occurence substance related incidents during this time and provide sufficient prevention.

Q2: How does the presence of drugs or alcohol at the time of arrest (as recorded in the Drugs_or_Alcohol_Present column) vary by gender, and which gender shows a higher proportion of substance-involved arrests?

# Filter Drugs_Alcohol and remove unknowns
arrests <- Arrests %>%
  filter(
    !is.na(Gender), Gender %in% c("M", "F"),           # keep only Male and Female
    !is.na(Drugs_Alcohol), Drugs_Alcohol %in% c("Y", "N")  # keep only valid substance values
  ) %>%
  mutate(
    Substance = ifelse(Drugs_Alcohol == "Y", "Substance Present", "No Substance")
  )

### Summarize Proportions by Gender


# Filter out unknowns if needed, then summarize
q8_summary <- arrests %>%
  group_by(Gender, Substance) %>%
  summarise(Count = n(), .groups = "drop") %>%
  group_by(Gender) %>%
  mutate(Proportion = Count / sum(Count))

Visualization

ggplot(q8_summary, aes(x = Gender, y = Proportion, fill = Substance)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = scales::percent(Proportion, accuracy = 0.1)),
            position = position_dodge(width = 0.9),
            vjust = -0.5, size = 3.5) +
  scale_fill_manual(
    values = c("Substance Present" = "skyblue", "No Substance" = "red")
  ) +
  labs(
    title = "Proportion of Substance-Involved Arrests by Gender",
    x = "Gender",
    y = "Proportion of Arrests",
    fill = "Substance Involvement"
  ) +
  scale_y_continuous(labels = scales::percent_format()) +
  theme_minimal()


Interpretation

This plot shows the proportion of arrests involving drugs or alcohol for each gender. The data helps us identify whether substance-involved arrests are more common among one gender relative to their total arrests in Chapel Hill. Among all arrests 66.6% of male arrests involved drugs or alcohol, compared to 54.8% of Female arrests having substance involvement. This proportion reveals that substance involvement was more prevalent among males. This difference could reflect variation the types of offenses committed or a difference in how police handle males and females that are under the influence of substances.

Substance involved arrests account for more than half of Female arrests and two thirds of Male arrests indicating that drugs or alcohol play a significant role in the majority of arrest events in Chapel Hill. This implies a strong correlation between substance presence and events leading to an arrest. These findings have implications for substance abuse prevention programs and substance abuse based police training.

Potential further questions based on this investigation. 1. Can we predict the likelihood of a substance involved arrest based on gender, time of day, age, and use this to deploy substance abuse prevention resources.

  1. How can trends in subtance based arrests be used to predict emerging public health challenges

Deliverer: Emmett Leo

Q1: How do ethnicity and race correlate with the type of arrest? Is the racial profile of “on view” arrests different from those arrested via other means?

Arrests %>%
  group_by(Arrest_Type, Race) %>% 
  summarize(n=n()) %>%
  inner_join(summarize(group_by(Arrests,Race),nr=n())) %>%
  mutate(prop=n/nr) %>%
  ggplot() + geom_tile(aes(x=Race,y=Arrest_Type,fill=prop))
## `summarise()` has grouped output by 'Arrest_Type'. You can override using the
## `.groups` argument.
## Joining with `by = join_by(Race)`

Q2: How are different arrest types concentrated, by latitude and longitude? Are “on view” arrests more concentrated in the same areas that police are inclined to actively police, whereas other arrest types are more spread out, and is there even a significant difference between the two?

Arrests %>%
  group_by(Arrest_Type) %>%
  summarize(av_lon=mean(longitude), av_lat=mean(latitude),av_dev=mean(sqrt((longitude-av_lon)^2+(latitude-av_lat)^2)))
## # A tibble: 3 × 4
##   Arrest_Type                     av_lon av_lat av_dev
##   <chr>                            <dbl>  <dbl>  <dbl>
## 1 ON VIEW                          -79.0   35.9 0.0225
## 2 SUMMONED/CITED                   -79.1   35.9 0.0176
## 3 TAKEN INTO CUSTODY (WARRANT/LP)  -79.0   35.9 0.0200
type_grid = Arrests %>%
  mutate(rlat=round(latitude*200)/200, rlon=round(longitude*200)/200) %>%
  group_by(rlat,rlon) %>%
  summarize(on_view_prop=sum(Arrest_Type=='ON VIEW')/n(), n=n()) %>%
  filter(1.96*sqrt(on_view_prop*(1-on_view_prop)/n)<.15&on_view_prop>0&on_view_prop<1) # Margin in error is less than .15
## `summarise()` has grouped output by 'rlat'. You can override using the
## `.groups` argument.
ggplot() +
  geom_tile(data=type_grid, aes(x=rlon, y=rlat,fill=on_view_prop)) +
  #geom_point(data= arrests, aes(x=longitude, y=latitude, color=Arrest_Type), alpha = .1) +
  geom_path(data = ch, aes(x=`V1`, y=`V2`)) +
  coord_quickmap(xlim=c(-79.09,-78.99), ylim=c(35.86,35.99)) 

Follow-up Questions

New Questions Based Off Initial Investigation

  • Q1: What factors might explain why trespassing charges are more common among older individuals?
  • Q2: Do institutional locations like schools appear to influence where alcohol and drug-related arrests are recorded?
  • Q3: Do the characteristics of arrests in the “Unknown” age group and their concentration at the Chapel Hill Police Department suggest that these cases involve minors, and what might explain the high frequency of alcohol and drug related charges at this location?
  • Q4: Why is there a sharp drop in total arrests in 2021 compared to other years, does that correlate with the pandemic?

Investigation of Follow-up Questions

We decided to investigate Q3 and Q4 in further detail.

Question 3

uknown_age_data = police_arrests %>%
  filter(is.na(Age)) %>%
  group_by(Street, latitude, longitude) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count)) %>%
  head(20)
## `summarise()` has grouped output by 'Street', 'latitude'. You can override
## using the `.groups` argument.
leaflet() %>%
  addTiles() %>%
  setView(lng = -79.0558, lat = 35.9132, zoom = 13) %>%
  addMarkers(lng = -79.0558, lat = 35.9132, popup = "Chapel Hill, NC") %>%
  addCircleMarkers(lng = uknown_age_data$longitude, lat = uknown_age_data$latitude, radius = uknown_age_data$Count / 2)
police_department_data = filter(police_arrests, Street == "828 MARTIN LUTHER KING JR BLVD", Charge == "CONSUME ALCOHOLIC BEVERAGE LESS THAN 21" | str_detect(Charge, "DRUG"))

ggplot(police_department_data)+
  geom_bar(aes(x = Arrest_Type))


Question 4

#checking why no spring and summer in 2020
# List of all years & months in your data

all_months <- month.abb

# All year-month combinations
all_combinations <- expand_grid(
  year = unique(Arrests$year),
  month = all_months
)

# Actual counts
actual_counts <- Arrests %>%
  count(year, month)

# Convert both month columns to character explicitly
all_combinations <- all_combinations %>%
  mutate(month = as.character(month))

actual_counts <- actual_counts %>%
  mutate(month = as.character(month))

# Left join + fill missing with 0
complete_counts <- all_combinations %>%
  left_join(actual_counts, by = c("year", "month")) %>%
  mutate(n = replace_na(n, 0)) %>%
  arrange(year, match(month, all_months))

# Show missing months
missing_months <- complete_counts %>% filter(n == 0)

print(missing_months)
## # A tibble: 20 × 3
##     year month     n
##    <dbl> <chr> <int>
##  1  2021 Apr       0
##  2  2021 May       0
##  3  2021 Jun       0
##  4  2021 Jul       0
##  5  2021 Aug       0
##  6  2021 Sep       0
##  7  2024 May       0
##  8  2024 Jun       0
##  9    NA Jan       0
## 10    NA Feb       0
## 11    NA Mar       0
## 12    NA Apr       0
## 13    NA May       0
## 14    NA Jun       0
## 15    NA Jul       0
## 16    NA Aug       0
## 17    NA Sep       0
## 18    NA Oct       0
## 19    NA Nov       0
## 20    NA Dec       0
missing_months %>%
  kable("html", align = "lccrr") %>%
  kable_styling(full_width = TRUE)
year month n
2021 Apr 0
2021 May 0
2021 Jun 0
2021 Jul 0
2021 Aug 0
2021 Sep 0
2024 May 0
2024 Jun 0
NA Jan 0
NA Feb 0
NA Mar 0
NA Apr 0
NA May 0
NA Jun 0
NA Jul 0
NA Aug 0
NA Sep 0
NA Oct 0
NA Nov 0
NA Dec 0
# Prepare data
daily_arrests <- Arrests %>%
  filter(!is.na(ArrestDate)) %>%
  count(ArrestDate) %>%
  mutate(year = lubridate::year(ArrestDate))

# Get years
years <- sort(unique(daily_arrests$year))

# Loop: save PNG for each year
for (yr in years) {
  cat("Rendering year:", yr, "\n")
  
  p <- ggplot_calendar_heatmap(
      daily_arrests %>% filter(year == yr),
      'ArrestDate',
      'n'
    ) +
    scale_fill_gradient(low = "white", high = "red", name = "Arrests", na.value = "grey") +
    labs(
      title = sprintf("Calendar Heatmap of Daily Arrests — Year: %s", yr),
      subtitle = "Darker colors indicate more arrests, grey = missing data"
    ) +
    theme_minimal(base_size = 14)
  
 # ggsave(sprintf("calendar_%s.png", yr), plot = p, width = 12, height = 8)
}
## Rendering year: 2010 
## Rendering year: 2011 
## Rendering year: 2012 
## Rendering year: 2013 
## Rendering year: 2014 
## Rendering year: 2015 
## Rendering year: 2016 
## Rendering year: 2017 
## Rendering year: 2018 
## Rendering year: 2019 
## Rendering year: 2020 
## Rendering year: 2021 
## Rendering year: 2022 
## Rendering year: 2023 
## Rendering year: 2024
imgs <- list.files(
  path = "heat map calendar for each year",
  pattern = "\\.png$",
  full.names = TRUE
) |> sort()

length(imgs)
## [1] 0
output_gif <- "calendar_heatmap_animation.gif"

#gifski(
#  png_files = imgs,
#  gif_file = output_gif,
#  delay = 2,   # adjust speed (seconds per frame)
#  width = 1200,
#  height = 800,
#  loop = TRUE
#)

Summary

Arrests numbers have generally fallen year-over-year since 2009, most drastically during the pandemic. Somewhat counterintuitively, arrest numbers are generally lower in the summer months and school breaks, despite the presence of fewer students to arrest. Arrests are concentrated in population centers, along 86 and Franklin Street and in downtown Chapel Hill, and the two relevant zip codes. Underage drinking and drug related arrests are most concentrated in downtown Chapel Hill and Franklin Street, as well as at the police department. DWI is a common charge, especially for the young, while open container and failure to appear charges surprisingly inscrease in frequency with age group. Police data on reported incidents turned out to have very little relationship with the arrest data, with only 24 same-day pairs of gun-related incident reports and gun-related arrests in the entire dataset. Substance involved arrests are, predictably, more likely late at night and on weekends, and offenders are more likely to be men than in other arrests. Arrests of racially Hispanic and Asian offenders are more likely to be on-site arrests, while black offenders were likely to be taken into custody. On-view arrests were geographically well-distributed but much less common on the UNC campus.

To explore these patterns further, we investigated whether the “Unknown” age group may include minors and why there was a high concentration of drug and alcohol charges at the Chapel Hill Police Department’s address. We found that the “Unknown” group had a high number of underage drinking and marijuana-related charges, and these records often lacked details like age, race, and gender. Many of these arrests also occurred near schools or the police department, suggesting that they may involve minors whose information is legally protected. Additionally, examining arrests at the police department revealed that most were labeled as “Summoned/Cited” rather than physical detainments. This implies that the address may reflect the location where citations were processed, not necessarily where offenses occurred. Together, these findings suggest that a combination of data protection practices and recording conventions could explain both the anonymity in age and the arrest clustering at the police station. From the first plot investigating question 4, the sharp decline in arrests in 2021 coincides with missing data for several months of that year. According to the missing-months table, arrests were not recorded (or records are missing) for months such as April–September 2021. Therefore, the drop is at least partially attributable to incomplete data rather than a real drop in arrest activity. And from the animated heatmap, We can see the trends in arrests over the previous 15 years by looking at the calendar heatmap GIF, which cycles through all years and months in a chronological manner. The heatmaps’ color saturation shows the number of arrests; more arrests are indicated by darker red. Most months have at least some activity over the years, but the red’s intensity varies, indicating variations in the number of arrests over time. Many months appear grey, indicating missing data for those periods, especially during COVID years. The dramatic decline in annual arrests during the trend years is explained by this visual evidence. The animation illustrates the annual and seasonal trends in arrests while also drawing attention to the data gaps